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ABSTRACT 

This paper proposes an algorithm to improve the calcula¬ 
tion of confidence measure for spoken term detection (STD). 
Given an input query term, the algorithm first calculates a 
measurement named document ranking weight for each 
document in the speech database to reflect its relevance 
with the query term by summing all the confidence mea¬ 
sures of the hypothesized term occurrences in this docu¬ 
ment. The confidence measure of each term occurrence is 
then re-estimated through linear interpolation with the cal¬ 
culated document ranking weight to improve its reliability 
by integrating document-level information. Experiments are 
conducted on three standard STD tasks for Tamil, Viet¬ 
namese and English respectively. The experimental results 
all demonstrate that the proposed algorithm achieves consis¬ 
tent improvements over the state-of-the-art method for con¬ 
fidence measure calculation. Furthermore, this algorithm is 
still effective even if a high accuracy speech recognizer is not 
available, which makes it applicable for the languages with 
limited speech resources. 

Categories and Subject Descriptors 

H. 3.3 [Information Storage and Retrieval]: Information 
search and retrieval— search process, selection process', 1.2.7 
[Artificial Intelligence]: Natural Language Processing 
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I. INTRODUCTION 

Spoken term detection (STD) is a task designed for effi¬ 
cient keyword search (given text query) in a speech databases, 
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and plays a central role in information management and 
speech retrieval [12[ State-of-the-art STD ap¬ 

proaches include two subsystems. The first one is an auto¬ 
matic speech recognizer (ASR), which is used to transcribe 
the spoken utterances into text. The text transcriptions con¬ 
tain all the possibly recognized words with corresponding 
posterior probabilities |12[[Is] . The posterior probability 
as been one typical confidence measure plays a central 
role in keyword searching. The second subsystem is a key¬ 
word searcher which returns the results of term detection 
for each query term according to the decoded transcriptions. 
Formally, in STD applications, a confidence measure (CM) 
is defined to represent the reliability of each detected term 
occurrence, which is usually estimated by the recognizer 
|12| . Relying on the confidence measure, the final term de¬ 
tection results could be obtained by threshold-based recall. 
However, when only limited training resources are available 
for building the ASR system, the accuracy of the recognizer 
and the reliability of the confidence measure are relatively 
low, which makes it difficult to find correct query results in 
the speech database. 

This paper focuses on the calculation of confidence mea¬ 
sure for STD when the speech recognizer has been built. 
In this situation, a one-pass retrieval candidate set can be 
obtained for each query. Each candidate contains the term 
occurrence location information and the corresponding con¬ 
fidence measure. The baseline system of this paper could 
then be evaluated on it directly by conducting standard 
score normalization and final decision (Hi [l^ . To im¬ 
prove the reliability of term occurrences, some recent 
efforts have attempted to do this work and have achieved 
some improvements on STD task. In 10 the confidence 
measure of query occurrence is re-estimated based on the 
context consistency information. proposed a two-stage 
cascaded machine learning approach for rescoring keyword 
search outputs for low resource languages. proposed a 
modified logistic regression strategy for term detection op¬ 
timization. Discriminative score normalization method was 
introduced to normalize confidence measures through dis¬ 
criminative modeling [m]. Moreover, another method was 
proposed in to employ extra acoustic features for getting 
a better confidence measure. 

However, all these methods fail to utilize long-term con¬ 
texts at document or topic level, which has been proved to 
be useful for some other information retrieval (IR) tasks [15[ 
|23| . Clustering and latent topic models have also gained im¬ 
provements over traditional vector space models for IR [21[ 
[^. Besides, the well known PageRank algorithm considers 





the hyperlink between every two pages and computes a con¬ 
verged importance score for each page Inspired by these 
work, this paper proposes to integrate docnment ranking 
information into the calculation of confidence measures of 
term occurrences for spoken term detection. The document 
ranking information is defined to be the topic relevance 
between the document and query term. For each query 
term, there are some documents tend to be more related 
to it because they are of a similar topic. When examining 
the accuracy of STD results, those topic-related documents 
tend to contain more correct hits. In detail, this informa¬ 
tion is quantized as a ranking weight for each document in 
this paper. Based on the one-pass retrieval candidates for 
a specific query term, we first sum up the confidence mea¬ 
sures of all term occurrences in each document. The doc¬ 
ument ranking weights are then estimated by normalizing 
these sums and are further integrated into the original con¬ 
fidence measures through linear interpolation. Experiments 
on three standard STD tasks demonstrate the effectiveness 
of our proposed method. 

For the rest of this paper, we will describe the related 
works of this paper in Sectionj^ The proposed algorithm for 
confidence measure calculation will be presented in Section 
Section and are the experimental setup and results 
on three standard STD tasks. Finally, we will conclude our 
work in Section 


2. RELATED WORK 

There are some other work attempted to utilize long-term 
contexts for STD. In [^, they improved term detection per¬ 
formance based on the word burstiness in spoken conversa¬ 


tional corpora. More recently, 22 17 took advantage of 
word repetition to improve spoken term detection, having 
observed the phenomenon of word repetition within single 
documents. They leveraged the burstiness of keywords by 
taking the most confident keyword hypothesis in each doc¬ 
ument and interpolating with lower scoring hits. Although 
they had designed an effective method to determine the in¬ 
ter coefficients in their experiments, they focussed on intra¬ 
document term repetition, without paying attention to the 
inter-document contexts, e.g. the document ranking infor¬ 
mation used in this paper. The work in is very simi¬ 
lar to us since they also gave a high priority to the candi¬ 
date segments that are included in highly ranked documents. 
However, they proposed to calculate the position dependent 
document weights recursively. This paper calculates docu¬ 
ment ranking weights in a more easier way and considers 
the inter document ranking information. In this paper, we 
will rank all documents in the speech database according to 
their relevance with a specific query term and incorporate 
such document ranking information into the calculation of 
confidence measures. 


3. PROPOSED METHOD 

For an input query term, a set of one-pass retrieval can¬ 
didates in the speech database is hrstly generated follow¬ 
ing the conventional STD approach. Each term detection 
occurrence commonly contains location information and a 
confidence measure, while the location information usually 
includes the located document name (or ID), start time and 
duration time. For example, for term t, we use Oi to repre¬ 
sent the location information of the i-th detection occurrence 


Algorithm 1 Calculate Document Ranking Weights Given a 
Query Term 

Input: The set of one-pass retrieval candidates given query 
term t. 

Output: The document ranking weights for all documents 
in the database. 

Main procedure: 

1. Document Clustering 

Cluster the documents in all the hypothesized oc¬ 
currences of term t by summing all the conhdence 
measures in each document d: 

5'd(t) = ^ CMb,,e(t|Oi,d), (1) 

Oi6d 

where Sd{t) can be viewed as the occurrence possibility 
of term t in document d. The maximum score 5max(t) 
for term t can also be obtained if we traverse all the 
documents. 

S'max(t) = max Sd{t). (2) 

dgall documents 

2. Document Ranking 

The ranking weight Wd(t) for each document is 
calculated using the “relative-to-max” method, which 
is obtained by dividing 5d(t) by S'max(t): 

Wd{t) = 5'd(t)/S„ax(t). (3) 

End 


of term t. If the location information indicates that this 
occurrence candidate belong to document d, then the conh¬ 
dence measure of the i-th term detection occurrence conh¬ 
dence measure can be denoted as CMbase(t|Oi, d)- We use 
subscript “base” to emphasis that this measure is obtained 
from the one-pass retrieval candidate set. The conhdence 
measure is designed to describe the reliability of a detected 
term occurrence, i.e., a correct query hit is expected to have 
a high conhdence measure. However, when the ASR subsys¬ 
tem performs poorly, there may be many false alarms with 
high conhdence measure as well as correct candidates with 
low conhdence measure. 

Based on the idea we have described in the introduction 
section, we propose to use document ranking information to 
improve the calculation of conhdence measures. The algo¬ 
rithm to estimate the document ranking weight Wd{t) for a 
input term t is described in Algorithm After the calcula¬ 
tion of document ranking weights, we re-estimate the conh¬ 
dence measure of each occurrence by combining the original 
one with the ranking weight of the document it belongs to. 
In this work, a linear interpolation is adopted as 

CMnew(t|Oi, d) = aWdit) -I- (1 - a)CMbase(t|Oi, d), (4) 

where the interpolation coefficient a for interpolation is con¬ 
sistent for all query terms, and it can be tuned using a 
development set. In short, the algorithm of conhdence re¬ 
estimation can be divided into three steps, i.e., document 
clustering, document ranking and conhdence re-estimation. 

4. EXPERIMENTAL SETUP 

4.1 Data Set and Evaluation Condition 

The experiments were conducted using three standard 
spoken term detection tasks, the STD 2006 English conver- 








sational telephone speech (CTS) evaluation set, the OpenKWS 
2013 Vietnamese and the OpenKWS 2014 Tamil develop¬ 
ment set^ The English CTS evaluation set included about 
3 hours of speech, and the keyword set consisted of 411 key¬ 
words. The development sets of Vietnamese and Tamil in¬ 
cluded about 10 hours of speech respectively. The evaluation 
keyword set for Vietnamese consisted of 4065 keywords, with 
901 of those keywords appearing in the development set and 
being used in our experiments. For the Tamil task, we used 
the kwlist3 keyword set supplied by IBM, which consisted 
of 2375 keywords. The intention of using three tasks was to 
evaluate the proposed algorithm using three very different 
languages, with different ASR accuracy, different amounts 
of training data and with variations in the sizes of keyword 
sets. The evaluation criterion used in the experiments was 
the Actual Term Weighted Value (ATWV) defined by NIST, 
using a cost function of the false alarm probability P(FA) 
and P(Miss), averaged over a set of querie^ 

4.2 Automatic Speech Recognizer 

Our ASR engines were built using the DNN-HMM based 
acoustic modeling, which is the state-of-the-art approach for 
speech recognition [18| . 

For the English task, 309 hours of Switchboard speech 
were used to train the acoustic model, and the transcriptions 
of these speech files were used to train a 3-gram language 
model. The cross entropy criterion was used to train the 
DNN models. The word accuracy (ACC) of the ASR system 
on the evaluation set was TT.67%. 

For the Vietnamese recognizer, two approaches were adopted 
to prevent the over-fitting problem in DNN training since 
the training corpus contains only about 70 hours of speech. 
The first approach was cross-lingual training, where we used 
a DNN model acquired from 1000 hours of Chinese CTS 
data to initialize the Vietnamese DNN parameters. Fur¬ 
thermore, the rectified linear unit (ReLU) activation func¬ 
tion was used to replace the sigmoid function in the DNN 
model. The transcripts of the Vietnamese training files were 
then used to train a 2-gram language model. A word ACC 
of 45.76% was achieved on the development set. The strat¬ 
egy employed for the Tamil ASR engine was similar to that 
used for Vietnamese. The only difference was that the se¬ 
quence training algorithm was applied in the DNN training 
for Tamil. A word ACC of 31.03% was achieved on the 
development set. 

4.3 STD Indexer and Keyword Searcher 

We designed a toolkit named iSTD to build our keyword 
search subsystem for STD. We followed the work in [12[ 
|13| to construct the inverted index based on confusion net¬ 
works. The term occurrence candidates were then found by 
keyword searching on the inverted index. The confidence 
re-estimation algorithm proposed in this paper was also in¬ 
tegrated into this toolkit. 

5. EXPERIMENTAL RESULTS 
5.1 Effectiveness of Document Ranking 


^ http://www.nist.gov/itl/iad/mig/openkws.cfm 

^http: //www.itl.nist .gov/iad / mig/tests / std/2006/docs / std06- 

evalplan-vl0.pdf 



Document Ranking Position Document Ranking Position 

Figure 1: Correlation Curve based on Document Ranking. 

In order to validate the rationality of applying the docu¬ 
ment ranking information to STD tasks, we examined the 
relationship between the performance of term detection and 
the document ranking positions. Here, the document rank¬ 
ing positions were derived by sorting all documents in de¬ 
scending order of the weights calculated following Algorithm 
Figure shows the correlation curve for the aforemen¬ 
tioned Vietnamese STD task. The results were obtained by 
averaging over 901 query keywords. The correlation curves 
reveal that the documents with high document ranking weights 
usually have high precision and recall of term detection. 

In addition, we calculated the Spearman rank correla¬ 
tion coefficient between the two performance measurement 
of term detection and the document ranking weights on the 
three STD tasks. The results are given in Table[^and shows 
the existence of high correlations. All these results indicate 
that the document ranking information is strongly corre¬ 
lated with the STD performance and it is reasonable to in¬ 
tegrate it into the calculation of confidence measures for the 
term detection. 


Table 1: Spearman correlation for three STD Tasks. 


STD Task 

Spearman Correlation 

Language 

ACC 

Precision-Rank 

Recall-Rank 

English 

78% 

0.93 

0.74 

Vietnamese 

46% 

0.74 

0.73 

Tamil 

31% 

0.70 

0.68 


5.2 Results of Tuning Interpolation Coefficients 

The interpolation coefficient a in (4) controls the balance 
between the document weights and the baseline confidence 
measures for a specific query term. To explore its practical 
effects, the ATWVs on the development set of the Viet¬ 
namese STD task versus different interpolation coefficients 
were depicted in Fig. We can see that a reasonable choice 
for a is within the range 0.05 to 0.4. In the next section, ex¬ 
perimental results will be presented for different tasks, where 
a was tuned on the development and set to be 0.05, 0.1 and 
0.15 for Tamil, Vietnamese and English respectively. 

5.3 Results of STD Tasks 

We compared the proposed confidence measure re-estimation 
algorithm with the baseline system for the three STD tasks. 
The baseline system directly adopted the ASR posterior 
score as the confidence measure for each query term. Keyword- 
specific threshold was applied for all systems as the final 
decision recall method [^. Experimental results are listed 
in Table We can see that the proposed confidence re¬ 
estimation approach achieves consistent improvements for 
all the three typical speech retrieval tasks. Considering the 
amount of training data available in these three tasks, the re- 
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interpolation coefficient 

Figure 2: Effect of different interpolation coefficient. 


Table 2: Term Detection Results for Three Tasks (ASR 
recognition accuracy: English=78%, Vietnamese=46%, 
Tamil=31%). 


Language 

Confidence 

ATWV 

P(Miss) 

English 

Baseline 

0.8064 

0.142 

Proposed 

0.8182 (+1.5%) 

0.119 

Vietnamese 

Baseline 

0.3661 

0.583 

Proposed 

0.3779 (+3.2%) 

0.565 

Tamil 

Baseline 

0.2785 

0.661 

Proposed 

0.2934 (+5.4%) 

0.626 


suits in Tablej^also indicate that the proposed confidence re¬ 
estimation method is neither language-dependent, nor sen¬ 
sitive to the amounts of training resources. 

6. CONCLUSIONS 

This paper has presented an algorithm to improve the cal¬ 
culation of confidence measures for spoken term detection. 
Inspired by the PageRank algorithm and the application of 
language models in the text information retrieval area, we 
propose to integrate the document ranking information into 
the calculation of confidence measures for term occurrences. 
The document ranking information indicates the topic rel¬ 
evance between each document and the query term, while 
topic-related documents are expected to contain more cor¬ 
rect hits. Experiments on three standard STD tasks demon¬ 
strate the effectiveness of this algorithm by introducing doc¬ 
ument ranking information. 
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